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Abstract 

Background: Highly mutable RNA viruses exist in infected hosts as heterogeneous populations of genetically close 
variants known as quasispecies. Next-generation sequencing (NGS) allows for analysing a large number of viral 
sequences from infected patients, presenting a novel opportunity for studying the structure of a viral population 
and understanding virus evolution, drug resistance and immune escape. Accurate reconstruction of genetic 
composition of intra-host viral populations involves assembling the NGS short reads into whole-genome sequences 
and estimating frequencies of individual viral variants. Although a few approaches were developed for this task, 
accurate reconstruction of quasispecies populations remains greatly unresolved. 

Results: Two new methods, AmpMCF and ShotMCF, for reconstruction of the whole-genome intra-host viral 
variants and estimation of their frequencies were developed, based on Multicommodity Flows (MCFs). AmpMCF 
was designed for NGS reads obtained from individual PGR amplicons and ShotMGF for NGS shotgun reads. While 
AmpMCF, based on covering formulation, identifies a minimal set of quasispecies explaining all observed reads, 
ShotMCS, based on packing formulation, engages the maximal number of reads to generate the most probable set 
of quasispecies. Both methods were evaluated on simulated data in comparison to Maximum Bandwidth and 
ViSpA, previously developed state-of-the-art algorithms for estimating quasispecies spectra from the NGS amplicon 
and shotgun reads, respectively. Both algorithms were accurate in estimation of quasispecies frequencies, especially 
from large datasets. 

Conclusions: The problem of viral population reconstruction from amplicon or shotgun NGS reads was solved 
using the MCF formulation. The two methods, ShotMCF and AmpMCF, developed here afford accurate 
reconstruction of the structure of intra-host viral population from NGS reads. The implementations of the 
algorithms are available at http://alan.cs.gsu.edu/vira.html (AmpMCF) and http://alan.cs.gsu.edu/NGS/?q=content/ 
shotmcf (ShotMCF). 
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Background 

RNA-dependent RNA-polymerases of RNA viruses are 
error prone and frequently generate mutations, accumu- 
lation of which results in a diverse intra-host viral popu- 
lation of closely related genetic variants [1,2], commonly 
termed quasispecies by virologists. 

The advent of Next-Generation Sequencing (NGS) pre- 
sented invaluable opportunity for the in-depth evaluation 
of viral quasispecies and understanding the structure of 
viral intra-host populations in unprecedented detail [3,4] . 
The application of NGS to clinical and public health set- 
tings offers prospects for significant improvement in con- 
trolling drug resistance [5] and development of novel 
therapeutics and vaccines [6] . One of the major advantages 
of NGS is in generating sequence data at a scale that 
allows not only analysis of intra-host viral variants from a 
single amplicon or recovery of the consensus full-length 
genomic sequence but also reconstruction of the popula- 
tion of full-genome quasispecies from an infected host. 

The problem of reconstruction of a structure of viral 
population formulated as quasispecies spectrum recon- 
struction problem was recently addressed in several studies 
[7-11]. Given a collection of the shotgun or amplicon 
NGS reads generated from a sample of the viral popula- 
tion, the algorithms reconstruct a set of quasispecies and 
their relative frequencies. All published algorithms are 
based on generating graphs of read overlaps and use mini- 
mum-cost flows, probabilistic methods, shortest paths, or 
maximum bandwidth to reconstruct a set of quasispecies 
from the graphs [7-11]. The accuracy of reconstruction is 
affected by the heterogeneity of intra-host viral population. 
The abundance of conserved genomic regions that extend 
beyond the maximal read length significantly restricts the 
full-genome quasispecies assembly. Indeed, even short 
conserved regions at the overlaps of reads significantly 
increase ambiguity of quasispecies reconstruction. 

Most algorithms for the quasispecies spectrum recon- 
struction implicitly assume that sequence data were 
obtained using a shotgun experiment. Although the shot- 
gun method is frequently used for reconstruction of long 
sequences and produces less distortion in frequency of 
quasispecies than the amplicon-based approach, the avail- 
able NGS error correction algorithms are most efficient 
when applied to amplicon-based data [12,13]. Additionally, 
although most quasispecies spectrum reconstruction algo- 
rithms are technically applicable to both types of data, the 
amplicon-based approaches allow for a greater control 
over the distribution of reads across the entire sequence of 
interest, resulting in a more accurate estimation of the 
structure of viral population [9,10]. 

In this paper, we consider two methods, AmpMCF 
and ShotMCF, for reconstruction of the genetic struc- 
ture of intra-host viral population using either amplicon 
or shotgun NGS reads, respectively. Both methods are 



based on the application of MultiCommodity Flow pro- 
blem (MCF) [14]. 

Methods 

MCF is a classical optimization problem that searches 
for k flows for k source-sink pairs (5/, tj) in a network N 
in order to either minimize the total cost of flows or 
maximize the total flow subject to capacity and demand 
constraints. 

Quasispecies reconstruction can be formulated as an 
optimization problem in two ways: (1) identification of 
the most probable set of quasispecies formed by the lar- 
gest subset of reads from the data, referred to as pack- 
ing formulation; and (2) identification of a minimal set 
of quasispecies explaining all observed reads, referred to 
as covering formulation. These two formulations, when 
applied to MCFs, were developed into path packing and 
path covering algorithms of ShotMCF and AmpMCF, 
respectively. 

AmpMCF algorithm 

We consider an amplicon A as a multiset of reads such 
that each read re A has the same predefined starting and 
ending position in the genome start(A) and end(A), 
respectively. Two amplicons A^, A2 are considered over- 
lapped if and only if start(Ai) < start(A2) < end(A^) < end 
(A2). A set of amplicons A - {Ai, A^} is said to be 
overlapping if and only if Ai and A^+i overlap for / = 1... 
m-1. Given an overlapping set A, we define a partial 
order < on the set of reads R = Ai\J,,,\JA^ as follows: r <r' 
if and only if reAi, r'eAi+i and r and r are consistent 
over their overlap of length li^+i = end(Ai)- start(Ai+i)+l, 
i.e., the suffix of length li,i+i of r coincides with the prefix 
of length li4+i of r'. 

Given an overlapping set A = {Ai, A^}, we con- 
struct an (m+2)-staged directed vertex-weighted read- 
graph as follows: G = {V(G) = Vi U ... U U {5, t}, E 
(G), c), where each vg V^, 1 < i < m corresponds to a 
distinct read eA^. An edge (w, v) g E(G) if and only if 
either ru <ry or w = 5, ve Ai or mgA^, v = t. We also 
define the function c: Vi U ... U ^ [0,1], where c(v) 
denotes the frequency of the read represented by v g Vi 
in amplicon A^. It is evident that every full-size quasispe- 
cies that has a sequenced read from each amplicon A^ 
corresponds to an (5, ^)-path in the graph G. 

A bipartite clique of G is defined as a set of vertices 
CQV{G) such that C^V^UV^+i for some / and every ver- 
tex from the set CnV^ is adjacent to every vertex from 
the set CnVi+i. 

Lemma 1. Consistent overlaps in amplicons A^ Ai+j 
correspond to disjoint bipartite cliques in G. 

Proof. Suppose the contrary; then there exist vertices 
V, v' G Ai and u, u' gA^+i, such that <ruf ry <ru'y ry' 
<ru, but it is not true that ry' <ru'> Since v and ru are 
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comparable but r^' and r^^' are not, the prefixes of length 
hj+i of and r^^' must not be consistent. This implies a 
contradiction with and <ri^'. 

Using this simple finding, we transform the read graph 
G into a new "forked" edge-weighted directed read- 
graph H = {V(H), E(H), d) as follows. Consider each p x 
<5^-bipartite clique C = Kp^q of G not containing vertices 
s,t, C Q for some /g {1, Add a new 

"fork" vertex Vfayj^, delete all edges of the bipartite clique 
C and add edges from the sets {{u,Vfork): CnAJ and 
{{vforh CriAi+i}. Define a new edge weight function 

d : E(H) N 2iS follows: d{uVfark) = c(u)y d(vfarkv) = c(v), 
d(su) = d(vt) = 0. Figure 1 illustrates this transformation. 

As for G, every full-size quasispecies corresponds to 
(5, ^) -path in the forked read graph H, However, H is 
(2m +1) -staged directed graph with much fewer edges 
than G: for every bipartite clique K^^^ pq edges in G are 
replaced by only p+q edges in H, Since in network flow 
problems variables usually are associated with edges, 
this reduction is highly useful for the construction of 
the fast network flows-based algorithm for the quasispe- 
cies spectrum reconstruction problem. 

The quasispecies reconstruction problem may be 
restated as the following covering problem: 

Problem 1. Given a forked read graph //, cover H 
with a set of unique (5, ^) -paths Pi with frequencies gi 
such that the total frequency of paths is minimal and 
for every directed edge (w,v)g E(H) the sum of frequen- 
cies of paths containing {u, v) is at least d(uv). 

We next reformulate Problem 1 as an MCF problem. 
Suppose that k is an upper bound for the number of 
quasispecies (k is the parameter of the algorithm analo- 
gous to the parameters of clustering algorithms such as 
k-means). Then an exact solution of Problem 1 could be 
obtained using the following Mixed Integer Linear Pro- 
gramming formulation: 

minimize ^ 

i=l,...k (1) 
(5, u) eE (H) 



s.t. 

k 

J2^uv>diuvl iu^v)eE(H) (2) 

i=l 

E ^uv= E ^L^^eV(H)\{5,t}, i=l k 

J2 /;„<l,i^ey{H), i=l k 

{v,u]eE{H] 

fL>C'iu,v)Em i=l k (5) 

4e{0,l}, (M,v)e£(H), i=l k (6) 

gl€[0,l],{u,v)€E{H), i=l k (7) 

The variables g]^^ represent the values of the flow / on 
the edge {u,v). With each flow ^ we associate a binary 
vectors/ such that for every {u,v)gE(H) 

i{C>Othenfi=l (8) 

This condition is guaranteed by the constraints (5). 
Constraints (2) and (3) are covering and flow conserva- 
tion constraints, respectively. Constraints (4) guarantee 
that flows ^ are unsplittable for every / = 1, /c, i.e. the 
edges carrying each flow form a simple {s^t)- path Pi in 
the forked read-graph //. In particular, the constraints 
imply that for every / = 1, k the values g^^ are equal 
for all edges of P/. Therefore g]^^ can be interpreted as 
values proportional to frequencies of quasispecies 

The frequency of i-th quasispecies is calculated as the 
normalized size of the i-th. flow by the formula 

^sueE{H] Ssu . . 

— 7' 
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ShotMCF algorithm 

The input is a set of distinct reads R with counts {Cy : v e 
R) and a set of candidate sequences Q = {qi, q/^} gener- 
ated using the max bandwidth method of ViSpA. We con- 
struct the directed read graph G = £) as follows: 

1) for each read ry eR aligned with the reference 
sequence add a vertex ve V; the consensus of candi- 
date sequences can be used as a reference; 

2) the directed edge {u, v) belongs to E if and only if 
some suffix of overlaps with a prefix of and the 
two reads agree inside the overlap; 

3) for each candidate sequence qi^ Q add a source 5^ 
and a sink ti. Add edges (Sj^v) and (v,ti) for each ver- 
tex VG R such that coincides with the prefix or suf- 
fix of qif respectively. 

Let a read of length / be aligned with a candidate 
sequence qt and its alignment have / mismatches (repla- 
cements, insertions and deletions). Let p^y be the prob- 
ability that read was obtained from the sequence qi. 
This probability can be estimated as 

pi = (i)\i-sy-i, (10) 

where s is the sequencing error rate, i.e. the probabil- 
ity of error per nucleotide. Note that the analogous for- 
mula was used in the quasispecies theory for the 
calculation of the probability of mutation between two 
different quasispecies [15]. 

Using the read-graph constructed above, the quasispe- 
cies frequencies estimation problem can be formulated in 
terms of MCF as follows. Each (5/, ^/)-path corresponds to 
some full-genome quasispecies, which can coincide with 
qi with a probability depending on values p\. By using p\ 
as coefficients in the MCF objective function, we arrive 
to the following formulation: 

k 

maximize p^gl (11) 

veV i=l 

s.t. 

j:gi<Cy,veV (12) 

i=l 

J2^uv=J2^^' VGV\{5i,tf}, l=l,...,fe (13) 

uveE vweE 

gly>0, uveE,i=l,...,k (14) 

Here gl^ are flow variables, gi = J^uveEglv ^re auxili- 
ary variables used for the simplicity of notations, which 



represent total flow through vertices vg K The resulted 
flow is fractional and can split, thus allowing for read 
errors and mutations. (11)-(14) is a variant of MCF, 
where vertex capacity constraints are used instead of 
edge capacity constraints. Once the problem is solved, 
the frequency of each candidate quasispecies could be 
estimated using (9). 

Results 

In order to validate the devised methods, we used reads 
simulated from experimentally identified intra-host 
HCV variants or quasispecies. 

The simulated reads were generated using individual 
1734-nt sequences derived from the E1/E2 genomic region 
of intra-host HCV variants [16]. For each run of the algo- 
rithm, quasispecies populations were generated using 10 
randomly selected sequences with randomly assigned fre- 
quencies. Quasispecies frequencies were generated accord- 
ing to uniform, geometric, and skewed distributions. 

1) In the uniform distribution all sequences have 
approximately equal frequencies, which were calcu- 
lated as normalized numbers of times each sequence 
was chosen in 1000 independent trials, where at each 
trial one of sequences was randomly chosen with an 
equal probability. 

2) In the geometric distribution frequencies form a 
geometric progression. The frequencies were calcu- 
lated by taking 10 first terms in geometric progressions 
and normalizing them. 

3) In the skewed distribution one of the sequences 
has a high frequency, while the remaining sequences 
have uniformly low frequencies (generated as in 1). 

The read lengths followed a normal distribution with 
mean value of 320nt and variance of lOnt. The number of 
reads in each simulated data set varied from 5K to 300K 
for ShotMCF and from 5K to lOOK for AmpMCF. Shot- 
gun reads were simulated using FlowSim [17]. We gener- 
ated amplicons with the length of 320nt and difference of 
250nt between starting positions of consecutive amplicons. 
The starting position of each amplicon read was chosen 
among amplicons starting positions using a uniform 
distribution. 

For each size of a data set and for each distribution type 
11 independent simulated data sets were generated, 
averages of measures of algorithms quality were calculated 
and the statistical significance of algorithms comparison 
was estimated using a Kruskal-Wallis test [18]. 

Problems formulations (l)-(7) and (11)-(14) were solved 
using the IBM ILOG CPLEX solver 12.2 (http://www.ibm. 
com/software/integration/optimization/cplex-optimizer/) 
with the default parameters. ILP for AmpMCF was solved 
in parallel on 16x Intel(R) Xeon(R) CPU X5550 2.67 GHz, 
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48 GB Memory with a running time limit 5 minutes per 
problem. LP for ShotMCF was solved in parallel on 24x 
Intel(R) Xeon(R) CPU E7450 2.40 GHz, 128 GB Memory 
to optimality. The average running time for solving LP for- 
mulation for ShotMCF varied from 30.945 seconds with a 
standard deviation 11.332 seconds for 5K reads to 352.301 
seconds with a standard deviation 56.861 seconds for 
300K reads. The average running time for solving ILP for- 
mulation for AmpMCF varied from 110.219 seconds with 
a standard deviation 106.342 seconds for 5K reads to 
126.270 seconds with a standard deviation 99.500 seconds 
for lOOK reads. 

P-values for a Kruskal-Wallis test were calculated 
using MATLAB (http://www.mathworks.com/products/ 
matlab/). 

ShotMCF algorithm 

The reconstructions obtained using ShotMCF and EM 
algorithms from ViSpA [8] were compared. It was 
shown in [8] that ViSpA with EM outperforms state-of- 
the-art algorithm SHORAH proposed in [7]. Since EM 
and ShotMCF use the same method for candidate qua- 
sispecies generation, both algorithms were evaluated for 



the accuracy of finding the distribution of quasispecies 
frequencies. Following [8] and [19], we used two mea- 
sures of accuracy: Root Mean Square Error (RMSE) and 
KuUback-Leibler divergence (KLD) [20] between the 
estimated distribution and the true distribution. KLD is 
a quasi-metric that measures the distance between two 
probability distributions P={pi, Pn) and W={wi, 
Wyi) by the following formula: 



KLD (P, W) 



Figures 2, 3 illustrates the comparison of ShotMCF 
and EM algorithms. The difference in performance 
between two algorithms is statistically significant for all 
distributions and sizes of data. The p-values of a Krus- 
kal-Wallis test are summarized in Table 1. 

ShotMCF statistically significantly outperforms EM on 
large data sets with geometric and skewed distributions, 
while the quality of EM is higher on small data sets. 
The quality of quasispecies reconstruction by EM, as 
implemented in ViSpA [8], declined with the increase in 
the dataset size for large numbers of reads, and was not 
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Figure 3 Comparison of ShotMCF and EM - KLD. 



significantly affected for ShotMCF. EM produced more 
accurate results on data sets with up to 300K reads gen- 
erated using the uniform distribution. However, the 
trend of decrease in quality of EM estimations suggests 
that ShotMCF is more accurate on the larger data sets 
generated using the uniform distribution. 

The accuracy of frequency estimation for variants with 
different abundances was analysed (Figure 4). All 



analysed sequences were partitioned into 5 groups 
according to their frequencies /: / < 0.025, 0.025 </ < 
0.05, 0.05 </< 0.1, 0.1 </< 0.2 and/> 0.2. x-axis repre- 
sents the groups and y-axis represents the average rela- 
tive error of ShotMCF for each group. Frequencies of 
high-abundance variants were estimated more accu- 
rately. The accuracy of frequencies estimation increases 
monotonically with the abundance and stabilizes 



Table 1 Statistical significance of the comparison of ShotMCF and EM 



Geometric distribution 


# of reads 


5000 


20000 


100000 


150000 


200000 


250000 


300000 


p-value RMSE 


0.000071 


0.000071 


0.000071 


0.002263 


0.000122 


0.000160 


0.000093 


p-value KLD 


0.000071 


0.000913 


0.000071 


0.038598 


0.006428 


0.016540 


0.000071 


Uniform distribution 


# of reads 


5000 


20000 


100000 


150000 


200000 


250000 


300000 


p-value RMSE 


0.000071 


0.000071 


0.000071 


0.000443 


0.000566 


0.001449 


0.005258 


p-value KLD 


0.000071 


0.000071 


0.000122 


0.000345 


0.000566 


0.001449 


0.005258 


Skewed distribution 


# of reads 


5000 


20000 


100000 


150000 


200000 


250000 


300000 


p-value RMSE 


0.000071 


0.000071 


0.027823 


0.000071 


0.000720 


0.000093 


0.000071 


p-value KLD 


0.000071 


0.000071 


0.027823 


0.000071 


0.001152 


0.001152 


0.000071 
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approximately at the abundance 0.1. The quaUty of fre- 
quency estimation increases, in general, with the num- 
ber of reads in data set for all groups. 

AmpMCF algorithm 

The reconstructions obtained using AmpMCF (k = 12) 
and the Maximum Bandwidth (MB) algorithm proposed 
in [9] were compared. Maximum bandwidth is based on 
the packing formulation of the quasispecies spectrum 
reconstruction problem, and was shown to outperform 
the algorithm for quasispecies spectrum reconstruction 
from amplicon reads proposed in [10]. The following 
measures of quality of a solution were used: 

1) RMSE 

2) Jensen-Shannon divergence (JSD). It replaces KLD 
used for ShotMCF testing, since for AmpMCF and MB 
sizes of the reconstructed quasispecies populations may 
differ from the size of the correct population. JSD differs 
from KLD divergence due to the addition of a midpoint 
and is defined as follows: 



1 1 
JSD (P, W) = -KLD (P, A^) + -KLD ( W, A^) , 

where P and S are probability distributions and 
M = ^(P+ W). 

3) Sensitivity S, which is defined as 

^ \TruePositives\ 

\TruePositives\ + \FalseNegatives\ 

4) Positive predicted value (PPV), which is defined as 

\TruePositives\ 



PPV 



\TruePositives\ + \FalsePositives\ 



Here, if CandQ is the set of quasispecies found by the 
algorithm and SimQ is the set of simulated quasispecies, 
then TruePositives = CandQnSimQ, FalseNegatives = 
SimQ\CandQ and FalsePositives = CandQ\SimQ. 
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RMSE and JSD measure the quality of quasispecies 
frequencies estimation, and Sensitivity and PPV measure 
the quaUty of assembled quasispecies. Sensitivity is a 
measure of the positive identifications, which is defined 
as the percentage of correctly assembled quasispecies 
out of the population, PPV is a measure of the negative 
identification, which is defined as the percentage of cor- 
rectly identified quasispecies over all assembled 
quasispecies. 

Figures 5, 6, 7, 8 illustrate the comparison of 
AmpMCF and Maximum Bandwidth algorithms, and 
Table 2 summarizes statistical significance of the com- 
parison of the algorithms. 

According to RMSE, AmpMCF statistically significantly 
outperforms Maximum Bandwidth for all sizes of data 
sets with the geometric distributions, and for large data 
sets with the uniform distribution. Although AmpMCF 
exceeded in accuracy Maximum Bandwidth on the 5K 
and 20K datasets with the uniform distribution, the dif- 
ference in performance was statistically insignificant, 
with p-value being slightly greater than the statistical sig- 
nificance threshold of 5%. For the skewed distribution 



the results were comparable without statistically signifi- 
cant advantage of one algorithm over the other. 

According to JSD and PPV, ShotMCF statistically sig- 
nificantly outperforms Maximum Bandwidth on the 
lOOK data sets with the geometric distribution, while 
Maximum Bandwidth had the lower JSD values on the 
20K and lOOK data sets with the skewed distribution. 
For all other measures, sizes and distributions the 
results were comparable with no statistically significant 
advantage of one algorithm over the other. The p-value 
for S could not be calculated for the 5K data sets with 
the skewed distribution, since both algorithms were 
equally sensitive on all test examples. 

So AmpMCF outperformed Maximum Bandwidth in 
quasispecies frequencies estimation for populations with 
geometric and uniform distributions, while both algo- 
rithms showed a similar performance in quasispecies 
sequence reconstruction. 

The low sensitivity of AmpMCF and Maximum Band- 
width on the 5K data set with the skewed distribution is 
associated with the erroneous reconstruction of low- 
abundance variants by both algorithms, with only a 
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Figure 5 Comparison of AmpMCF and Maximum Bandwidth - RMSE. 
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dominant variant being correctly identified. For larger 
data sets, populations with the skewed distributions 
were reconstructed much more successfully and variants 
with frequencies as low as 0.8% were detected. It should 
be also noted that low-frequency variants were detected 
with higher probability in populations with the geo- 
metric distribution (Figure 9). It suggests that the reco- 
verability of low-frequency variants depends on the 
structure of a population and that the coverage provided 
by data sets of 5K reads is insufficient for low-frequency 
variants detection, if the population contains a dominant 
high-frequency variant. 

In general, abundances of variants greatly affect their 
recoverability, with high-frequency variants being easier 
to detect (Figure 10). As above, all analysed sequences 
in Figure 10 were partitioned into 5 groups according to 
their frequencies f : f < 0.025, 0.025 < f < 0.05, 0.05 < f < 
0.1, 0.1 < f < 0.2 and f > 0.2. x-axis represents the 
groups and y-axis the probability of variant recovery in 
each group. The probabilities of detection of variants 
within each group increase with the number of reads in 
a data set. While the probability of reconstruction of a 



variant with frequency less than 2.5% from the 5K data 
set was only 0.0092, all variants with frequencies greater 
than 20% were reconstructed from 20K and lOOK data 
sets. 

The accuracy of frequency estimation for detected 
variants with different abundances is illustrated by 
Figure 11. As for ShotMCF, the accuracy of frequency 
estimation increases with the abundance and stabilizes 
approximately at the abundance 0.1. In general, the 
accuracy of frequency estimation increases with the 
number of reads in a data set for all groups except for 
the group of low-frequency variants. The small value of 
RE for low-frequency variants from the 5K data sets can 
be explained with a low detection rate of such variants, 
which renders their RE undefined. 

Discussion 

Two different network-flows based formulations applic- 
able to quasispecies frequency reconstruction problem 
were developed. The first quasispecies spectrum recon- 
struction method based on network flows (NF) was pro- 
posed in [11]. However, usage of NF in that method 
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does not allow the direct reconstruction of quasispecies 
sequences and their frequencies. Rather, it selects pairs 
of overlapping reads that belong to the same sequence 
variant. For the direct quasispecies spectrum reconstruc- 
tion the second stage of the algorithm was proposed, 
which involves finding edge-disjoint paths in the net- 
work modified according to the results of the NF stage. 
The network modification substantially increases the 
number of edges; therefore the method is computation- 
ally expensive. 

AmpMCF extends the concept developed in [11]. It 
replaces NF with MCF, which allows for joining both 
stages of algorithm from [11] in a single MCF formula- 
tion and for solving it using a single algorithm. Such 
approach is more effective and allows for increasing 
quality of the solution. Moreover, instead of increasing 
the size of the network, AmpMCF allows to decrease it, 
thus making the problem much more computationally 
tractable. 

ShotMCF extends the ViSpA algorithm described in 
[8]. The method proposed in [8] consists of two stages: 



generation of candidate quasispecies sequences from 
shotgun NGS reads using Maximum Bandwidth paths in 
the read graph and estimation of their frequencies using 
the Expectation Maximization (EM) algorithm [21]. 
ShotMCF models and solves the quasispecies frequency 
estimation problem using MCF instead of EM. Unlike 
AmpMCF and the algorithm from [11], it is a packing 
algorithm that invokes the vertex rather than edge capa- 
city constraints and does not require integer variables. 
This new method in combination with the candidate 
sequences generation algorithm from [8] presents a 
novel framework for the reliable reconstruction of qua- 
sispecies spectrum. 

The formulation for AmpMCF could not be applied to 
shotgun data since it assumes that each full-length 
sequence corresponds to a unique (s,t)-path in the read 
graph. However, this is not true for the shotgun data 
since certain sequences can be assembled from reads 
through different paths. This observation taken together 
with consideration of the structure of the read graph 
described by Lemma 1 indicates that the formulation is 
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Table 2 Statistical significance of the comparison of AmpMCF and EM 



Geometric distribution 



# of reads 


5000 


20000 


100000 


p-value RMSE 


0.001100 


0.000069 


0.000070 


p-value JSD 


0.200130 


0.742240 


0.000718 


p-value S 


0.46294 


0.11743 


0.84517 


p-value PPV 


0.66827 


0.79078 


0.037853 


Uniform distribution 


# of reads 


5000 


20000 


100000 


p-value RMSE 


0.122800 


0.061063 


0.015030 


p-value JSD 


0.339790 


0.818120 


0.742170 


p-value S 


0.34978 


0.78918 


0.89135 


p-value PPV 


0.13832 


0.89501 


0.50755 


Skewed distribution 


# of reads 


5000 


20000 


100000 


p-value RMSE 


0.469220 


0.717980 


0.224440 


p-value JSD 


0.211260 


0.004284 


0.023486 


p-value S 




0.12341 


0.39881 


p-value PPV 


0.20846 


0.53018 


0.40896 
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Figure 9 Probabilities of detection of low-frequency variants (< 
0.025) for the geometric and skewed distributions - AmpMCF. 



more suitable for amplicons. The analogue of AmpMCF 
for a shotgun data is the NF-based algorithm from [8]. 
However, as aforementioned, it is computationally 
expensive and known to be outperformed by ViSpA. 

Although the formulation of ShotMCF is applicable to 
amplicons, AmpMCF is more suitable for this task since 
ShotMCF handles only the second stage of quasispecies 
spectrum reconstruction problem, with the first stage 
being the candidate sequence generation adopted from 
ViSpA; while AmpMCF incorporates the whole problem 
into a single formulation. 

The structure of the read graph explains a better match 
of the amplicon data to the covering rather than to pack- 
ing formulation implemented by Maximum Bandwidth. 
According to Lemma 1, consistent overlaps between con- 
secutive amplicons form bipartite cliques in a read graph. 
Edges within each bipartite clique are equal in respect to 
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Figure 10 Probabilities of detection of quasispecies depending 
on their frequencies - AmpMCF. 
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Figure 11 Dependence between relative error in frequency 
estimation and an abundance of a variant - AmpMCF. 



choosing (s,t)-paths in a read graph. This leads to a large 
number of peer alternatives for quasispecies assembling, 
indicating the need to search for the most parsimonious 
solution. The NF-based formulation with parsimony as 
an objective function and without predefined flow sizes 
requires covering constraints, and, therefore, leads to the 
covering formulation. 

The advantage of ShotMCF method over EM-based 
method of ViSpA originates from enforcing uniformity 
of quasispecies coverage and using more accurate for- 
mula for the probability of emission of a given read 
from a given candidate sequence. The major advantage 
of the EM algorithm over the current version of 
ShotMCF is a greater speed and reduced requirements 
for computational resources such as computer memory 
and number of parallel processors. The reason is that 
MCF is implemented directly using linear programming 
formulation. It is expected that application of faster 
methods; e.g., based on lagrangian relaxations or Bender 
decomposition, should dramatically increase perfor- 
mance of ShotMCF. 

It should be noted that MCF formulations assume 
absence of gaps in coverage. Although such gaps inter- 
rupt the assembly of the entire sequence, the genomic 
regions covered with reads can be identified using a 
reference sequence and quasispecies can be estimated 
with MCF-based algorithms for each region 
independently. 

Conclusions 

Two novel methods were developed for the reconstruc- 
tion of the structure of viral population from the NGS 
shotgun and amplicon reads. Both methods are based 
on MCF and found suitable for the reliable assembly of 
viral quasispecies and estimation of their frequencies. 
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